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1. Introduction 
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Graphic Processing Units (GPUs) offer a comput- 
ing architecture well suited for LatticeQCD applica- 
tions. Consequently, there is an on-going software- 
and algorithm development in order to incorporate 
GPUs effectively into lattice simulations. See for ex- 
ample ijl], ^ ^ Q]. These applications are developed 
and carried out predominantly on NVIDIA hardware, 
consistently using the NVIDIA exclusive CUDA lan- 
guage 1^ for the interaction with the GPU. Using 
GPUs is attractive because they have good price-per- 
flop (e.g. 2,0 €/ Gflop on a NVIDIA GTX 580 
and even 0,4 €/ Gflop on an AMD 6970 [||]) and 
performance-per-watt ratios. 

Recently, a new cluster was introduced at Frank- 
furt University, the "LOEWE-CSC" [01. It is dedicated 

to high-performance computing, but contrary to existing clusters it solely consists of AMD hard- 
ware. A sketch of its infrastructure is shown in Fig. |l|. A striking feature is its heterogeneous 
architecture: The majority of its compute nodes each hold two 12-core AMD Magny-Cours Cen- 
tral Processing Units (CPUs) and one AMD RADEON 5870 GPU. The LOEWE-CSC is ranked 
22 in the Top500 list of supercomputers [||] and rank 10 in the Green500 list of energy-efficient 
supercomputers (with 718 MflopsAVatt) [^. 

However, presently existing GPU appplications are mostly suitable for NVIDIA hardware. 



Figure 1: LOEWE-CSC 



Other than using graphic application programming interfaces (APIs) like OpenGL [10|, the only 



tool available to use AMD GPUs for general purposes is OpenCL [ 1 1 1. This is an open standard for 
parallel programming. Furthermore, it is explicitly designed for heterogeneous (or hybrid) systems, 
thus being well suited for the LOEWE-CSC as well as other, non-GPU platforms. Implementations 
of OpenCL can be found both from AMD (AMD Accelerated Parallel Processing (APP) [|l^], 
formerly ATI Stream SDK) and NVIDIA (as part of CUDA). 

The first lattice simulations in OpenCL were performed in [|l]] with staggered fermions. On 
NVIDIA hardware, a significantly lower performance (25% on C1060 and 60% on S2050) of 
OpenCL was reported compared to CUDA for Hybrid Monte Carlo (HMC) updates. An AMD 
GPU was also considered. Here, OpenCL performance is better than on Nvidia hardware, but still 
below CUDA results (50% less performance on an AMD 5870 in OpenCL than on a S2050 in 
CUDA). 



2. OpenCL 

In the following, we will present the general ideas of OpenCL and explain important terms. 



For more information see [12]. The generic concept of an OpenCL application consists of a 
"host" program and several "compute devices" (see Fig. They live together on a so called 
"platform". The host controls memory management and calculations carried out on the devices. 
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Figure 2: OpenCL Concept 



while each device may either be a GPU, a CPU or any other kind of supported compute de- 
vice. A single device then consists of a bunch of "compute units" (on a multicore CPU this 
would be a single core) which in turn can consist of one or more "processing elements" that carry 
out the actual computations. This of course reflects the ar- 
chitecture of CPUs. On CPUs a compute unit (i.e. a sin- 
gle core) and a processing element may also be the same, 
although this depends on the specific model and OpenCL 
implementation used. It should be noted that on a CPU it 
is possible to split up a multicore CPU into several devices, 
effectively grouping the cores suited for specific tasks. 

Furthermore, execution commands for OpenCL func- 
tions ("kernels") are scheduled in one or more "command queues" via the host. The queue launches 
execution of kernels on a specific device and also handles memory commands. Synchronization on 
this level is possible only within a command queue. 

Central objects of any OpenCL application are the kernels. These have to be written in the 
OpenCL C programming language, which is based on a subset of C 9 9, the C standard. Optionally, 
"native kernels" can be included from libraries. Kernels are executed using an up to 3-dimensional 
index space, where each index can be mapped on an instance of the kernel, which in turn are called 
"work items". Several work items make up a "work group", which allows for synchronization 
between work items. Kernels can access memory on various levels, ranging from "global" (e.g. 
the main memory on the GPU) to "private" (e.g. General Purpose Registers (GPRs) of a GPU 
stream-core). Besides the nomenclature, the setup is very similiar to CUDA. 

It is important to emphasize here that OpenCL allows for data- as well as for task parallel 
applications, meaning that it is on the one hand possible to perform SIMD computations and on 
the other hand to perform different (possibly independent) tasks in a parallel fashion, providing 
OpenMP Jli] ] or UNIX's pthreads functionality automatically. 

In order to have a hardware-independent programming model, the actual OpenCL program is 
compiled and built at runtime of the (host) application. This is done using an OpenCL inherent 
compiler. 



3. Implementation 



The physical problems we are currently interested in are investigations of the quark gluon 



plasma (QGP) and the thermal transition of QCD with dynamical fermions [ |15| , |16[ [ITp as well 
as in pure gauge theory (PGT). These have yet been carried out mainly relying on the tmlqcd 



program suite [ |18| ] and an application written with QDP++ []19[]. Several features shall thus be 
provided by the OpenCL application: On the PGT side we need a SU(3) heatbath algorithm [|^], 
whereas on the fermionic side we require an HMC algorithm including standard features (even-odd 
preconditioning, 2MN integrator, multiple integration timescales) for Nf = 2 twisted-mass Wilson 



fermions [18, 21 1. Also, ILDG-compatible I/O is required [22]. 

In order to account for the fundamentally different concept of OpenCL we decided to write 
an all new program. The implementation of the desired features mentioned above resulted in four 
executables, providing the possiblity to generate gauge configurations as well as calculate phys- 
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ical observables of interest for pure gauge theory and dynamical fermions. All calculations are 
performed in OpenCL. In the following we will go to some details describing the concrete imple- 
mentation. 
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(a) OpenCL module classes 
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Inverter ^ 

(b) Gaugefield classes 



Figure 3: Structure of opencl modules- and gaugefield classes 

The host program was set up in C++ in order to implement independent program parts easily 
using C++ classes and to have extension capabilities in a natural way. 

The central object is the class gaugefield, which incorporates the initialization of OpenCL 
and holds an application-specific number of opencl_device-objects. As the name indicates, 
the latter class contains all compute device-related parts, such as kernels or memory objects, and 
eventually executes the kernels on a specific device. The class gaugefield is also dedicated to 
synchronize the physical gauge field when it is used on several devices. 

The specific physical problems were implemented as child classes of gaugefield and 
opencl_module, containing the problem-related functionality (see Fig. ||). Furthermore, for 
each physical problem a number of "tasks" can be defined which then again can contain a num- 
ber of device objects to carry out this task. For example, the inverter executable essentially 
performs two tasks, the inversion of the fermion matrix and the calculation of correlators. This 
concept will prove useful when looking at hybrid applications. 

As was mentioned in section ^ the OpenCL environment has to go through certain initializa- 
tion processes before the actual calculations can be carried out. A schematic flow of this process is 
shown in Fig. ^. The major part of the initialization is spent on generating the kernels. This is done 
in a couple of steps: First, the files needed for the kernel code are collected, after that an OpenCL 
"program" is compiled and linked using the OpenCL compiler This program can then be used to 

build kernels referring to functions declared with kernel within the previously read-in source 

code. 

We found it convenient to build ev- 
ery kernel as a stand alone program 
since the kernel is the object of inter- 
est. The compiler generates binary files 
which can be used to extract informa- 
tions about the kernel, e.g. GPR usage, 
which is useful for benchmarking and 
optimization. They can also be reused 
for kernel generation at a later program 
run. This speeds up the initialization 
time significantly. 



Get Platform - 



Compile /Input-i 
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OpenCL device OpenCL device 1 ^ ■ ■ ■ 

Collect Sourcefiles/Compileoptions > Init Buffers /Kerne Is 

Further Execution. . . 



Figure 4: Schematic flow of program initialization 
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One can influence the whole process at this point by means of the simulation parameters. 
The latter are typically read in at runtime of the host, so one can e.g. switch between CPU and 
GPU simply via the input file. Since the kernels are compiled only at runtime, one can pass the 
simulation parameters (e.g. NT, NS, j8, . . . ) to them as compile time pai^ameters. This is a nice way 
of avoiding many kernel arguments as well as to "hard code" the parameters into the kernel code. 
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Figure 5: Dslash performance on AMD 5870 on various even-odd preconditioned lattices (statistics are of 
ff(\Ok) for each volume). 

In fermionic applications, the most time-consuming part is the inversion of the fermion matrix 
and the non-diagonal part of the Dirac matrix ("dslash"), respectively. On GPUs, this problem is 
always bandwidth-limited and tuning is required to achieve a satisfiying amount of the maximum 
bandwidth (on an AMD 5870 this is e.g. 154 GB/s). Our (even-odd preconditioned) dslash imple- 
mentation currently performs at 45 - 60 GFlops in double precision calculations (with a bandwidth 
utilization of up to 105 GB/s) over a wide range of lattice sizes (See Fig. ^). We will give more per- 
formance results for AMD hardware (also considering memory optimizations like RECONSTRUCT 
TWELVE |0]) in a future publication. 



4. Hybrid strategies 



Device 



Device 1 



Having a hybrid system as the LOEWE-CSC at hand, the question arises how one can use this 
infrastructure effectively. Both GPU and CPU hold several advantages, qualifying them for certain 
tasks. GPUs outperform CPUs when it comes to floating point operations, whereas a CPU can in 
general operate a bigger amount of memory and a bigger cache, just to name a few. OpenCL can 
be used quite easily to distribute computations over a hybrid system. 

A typical scenario in lattice simulations 
is the iterative calculation of some observ- 
ables out of a sequence of gauge configu- 
rations. The simplest case one can have 
is an observable that does not require any 
synchronization in between its calculation. 
Given 2 devices, one can distribute two gen- 

eraUzed tasks among them, as depicted in Figure 6: Simplest hybrid strategy. 

Fig. ^ One task calculates the observable while in the meantime the other provides the ingre- 



Sync Gaugefield 
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dients required for the next iteration's calculation. Synchronization between the devices is carried 
out at the start of each iteration. 

We implemented this concept for two observables, the Nf = 2 mesonic flavour doublet corre- 
lators with quantum number Y, 

Cr = -Tr {sl{xo,x)y,YSu{xQ,x)Ty,) , (4.1) 

for which one has to provide the propagator 5„(xo,x) ~ M^^b{xo) (where M is the fermion matrix 
and b a point source at site xq), and the second order transport coefficent K of the Quark Gluon 



Plasma [|23[]. K can be extracted from the retarted propagator Gr at zero Matsubaia frequency, 



GR{Q) = 0,q)=G{0)--\qr + > 



(4.2) 



which can be calculated on a given gauge configuration by the euclidean correlator Ge according 
to 

Gr{(0 = 0,q)= Ge{(0 =0,q)=N ^e^^fc-w) (ri2(x)ri2(3')) . 



(4.3) 



Tnv is a discretization of the energy-momentum tensor using clover-plaquettes [24]. 
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Sync gauge configuration 
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(a) Mesonic correlator Cr (b) Transportcoefticient K 

Figure 7: Hybrid strategies for two observables of interest. 



The concrete implementations can be seen in Fig. ^ where we considered the case of one 
CPU and one GPU device in the system (note that e.g. on one LOEWE-CSC node, one CPU 
device has by default 2*12 = 24 cores). The assignments of the different tasks to the devices came 
quite naturally, since K"'s calculation requires much more memory ressources than the heatbath 
algorithm and the inversion of the fermion matrix is carried out faster on the GPU. 

5. Conclusions 

We presented to some detail our implementation of LatticeQCD applications using OpenCL. 
This is a quite different approach compared to other applications around, which mainly operate on 
NVIDIA hardware using CUDA. Since OpenCL is a hardware-independent programming model, 
any hardware can be used, which offer an optimal price-per-flop ratio. Especially, parallel cal- 
culations can be performed quite simply in OpenCL, allowing for applications suited for hybrid 
architectures. We gave two examples using GPU and CPU devices at the same time effectively. 
We are currently benchmarking and optimizing our code to exploit the compute powers of the 
LOEWE-CSC and AMD hardware in general, providing an alternative to NVIDIA hardware bound 
applications. 
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